Skip to content

fix: unify ColumnNotFound for duckdb and pyspark #2493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

EdAbati
Copy link
Collaborator

@EdAbati EdAbati commented May 4, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@EdAbati
Copy link
Collaborator Author

EdAbati commented May 4, 2025

I think I can make some other clean-up of repetitive code. I'll try tomorrow morning

@EdAbati EdAbati marked this pull request as ready for review May 5, 2025 07:04
@EdAbati
Copy link
Collaborator Author

EdAbati commented May 5, 2025

I made a followup PR #2495 with the cleanup :)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this! just got a comment on the .columns usage

@@ -186,7 +187,14 @@ def from_column_names(
context: _FullContext,
) -> Self:
def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]:
return [col(name) for name in evaluate_column_names(df)]
col_names = evaluate_column_names(df)
missing_columns = [c for c in col_names if c not in df.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

I was hoping we could do something like we do for Polars. That is to say, when we do select / with_columns, we wrap them in try/except, and in the except block we intercept the error message to give a more useful / unified one

Copy link
Collaborator Author

@EdAbati EdAbati May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting, I was not aware πŸ˜•

What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)

Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong

I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

@MarcoGorelli could we add that to (#805) and put more of a focus towards it? πŸ™

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's documented, but evaluating .columns may sometimes require doing a full scan. Example:

In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)})

In [49]: rel = duckdb.table('df')
100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–

In [50]: rel1 = duckdb.sql("""pivot rel on a""")

In [51]: %timeit rel.columns
385 ns Β± 7.62 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

In [52]: %timeit rel1.columns
585 ΞΌs Β± 3.8 ΞΌs per loop (mean Β± std. dev. of 7 runs, 1,000 loops each)

Granted, we don't have pivot in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native, and the output schema of pivot is value-dependent (😩 )

The same consideration should apply to spark-like

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do those timings compare to other operations/metadata lookups on the same tables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.alias for example is completely non-value-dependent, so that stays fast

In [60]: %timeit rel.alias
342 ns Β± 2.3 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

In [61]: %timeit rel1.alias
393 ns Β± 2.6 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

@EdAbati EdAbati added pyspark Issue is related to pyspark backend pyspark-connect error reporting labels May 6, 2025
try:
return self._with_native(self.native.select(*new_columns_list))
except AnalysisException as e:
msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're written is great - even though we can't access them, we can still try to be helpful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the test into lazy and eager to simplify a bit the if-else statements. I hope it is a bit more readable ?

return df

if constructor_id == "polars[lazy]":
msg = r"^e|\"(e|f)\""
Copy link
Collaborator Author

@EdAbati EdAbati May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was msg = "e|f". Now it is a bit stricter

Comment on lines +105 to +106
with pytest.raises(ColumnNotFoundError, match=msg):
df.select(nw.col("fdfa"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this was not tested for polars

constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest
) -> None:
constructor_id = str(request.node.callspec.id)
if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")):
Copy link
Collaborator Author

@EdAbati EdAbati May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlframe and pystpark.connect raise errors at collect. πŸ˜•

I need to double check pystpark.connect. Currently cannot set it up locally... Working on it ⏳

Do you have an idea on how to deal with these?

@EdAbati EdAbati changed the title fix: unify ColumnNotFound for duckdb and pyspark/sqlframe fix: unify ColumnNotFound for duckdb and pyspark May 9, 2025
Comment on lines -115 to -120
df.drop(selected_columns, strict=True).collect()
else:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop should be already tested in drop_test.py

Comment on lines +50 to +53
msg = (
r"The following columns were not found: \[.*\]"
r"\n\nHint: Did you mean one of these columns: \['a', 'b'\]?"
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use parse_columns_to_drop and therefore raise our ColumnNotFoundError.from_missing_and_available_column_names(missing_columns=missing_columns, available_columns=cols) for every backend

@EdAbati
Copy link
Collaborator Author

EdAbati commented May 23, 2025

I finally had a few minutes to go back to this πŸ₯²

Few notes:

  • catch_{}_exception made it cleaner thanks for the tip
  • sqlframe is tricky because it raises at collect time. Also the error will be different based on the backend. Do we have a way to know which type of sqlframe we are dealing with. For duckdb we could use the catch_duckdb_exception but we need to make sure duckdb is available. Any ideas? Could we think about it in a follow up?
  • regarding simple_select, it should already be covered by the below. We are already testing it too:
    def select(
    self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
    ) -> Self:
    flat_exprs = tuple(flatten(exprs))
    if flat_exprs and all(isinstance(x, str) for x in flat_exprs) and not named_exprs:
    # fast path!
    try:
    return self._with_compliant(
    self._compliant_frame.simple_select(*flat_exprs)
    )
    except Exception as e:
    # Column not found is the only thing that can realistically be raised here.
    available_columns = self.columns
    missing_columns = [x for x in flat_exprs if x not in available_columns]
    raise ColumnNotFoundError.from_missing_and_available_column_names(
    missing_columns, available_columns
    ) from e

    Would you like to do something else for lazy backends?

Copy link
Collaborator Author

@EdAbati EdAbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of unrelated errors I'll check later.

See #2593 and #2594 (thanks @MarcoGorelli )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error reporting pyspark Issue is related to pyspark backend pyspark-connect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error reporting: unify "column not found" error message for DuckDB / spark-like
3 participants